Skip to content

[v8.0] Introduce Scout Agent and Optimizer#7251

Closed
qdcampagna wants to merge 20 commits into
DIRACGrid:integrationfrom
qdcampagna:ScoutJobs_DIRACv8
Closed

[v8.0] Introduce Scout Agent and Optimizer#7251
qdcampagna wants to merge 20 commits into
DIRACGrid:integrationfrom
qdcampagna:ScoutJobs_DIRACv8

Conversation

@qdcampagna

@qdcampagna qdcampagna commented Oct 20, 2023

Copy link
Copy Markdown

This pull request introduces the Scout Agent and Optimizer. They submit a subset of primary jobs first, then only allows the primary jobs to run once a certain number of Scout jobs are successfully finished. Follow up to #7118. Closes #7083.

BEGINRELEASENOTES

*WorkloadManagement
NEW: Introduce Agent and Optimizer for scouting jobs

ENDRELEASENOTES

@fstagni fstagni left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR.
The main problem I see, that requires change, is that you are calling directly JobDB() to get/set JobParameters. But, JobParameters can be stored also in ElasticSearch (and will only be stored there from DIRAC v9 on). So, use JobStateUpdateClient() instead.

Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py
Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
"""Sets defaults
"""

self.jobDB = JobStateUpdateClient()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it was me not being clear, but this can't be a 1-to-1 substitution between JobDB and JobStateUpdateClient, as the 2 don't expose the same methods. For example, JobStateUpdateClient does not expose the selectJobs method used below.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Does JobStateUpdateClient access only the methods found in JobStateUpdateHandler, or are there others?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only those.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would do this, for added clarity:

Suggested change
self.jobDB = JobStateUpdateClient()
self.jobStateUpdate = JobStateUpdateClient()

then you'll need at least also

        self.jobMonitoring = JobMonitoringClient()

Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
failedjoblist = []
failedsitelist = []
stalledjoblist = []
scoutjobs = result['Value'].keys()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beware, this is python3 and not py2 code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is safer:

Suggested change
scoutjobs = result['Value'].keys()
scoutjobs = list(result['Value'])

Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated

@fstagni fstagni left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, would you add a unit test or two?

Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated
Comment thread src/DIRAC/WorkloadManagementSystem/Agent/ScoutingJobStatusAgent.py Outdated

@andresailer andresailer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some documentation about this somewhere?

From the #7083

"Scout jobs" are created at the job submission -- when a user submits a set of jobs, our client tool makes a smaller set of shorter jobs as "scout jobs" and submits them (the original and the scout) altogether.

Would be interesting to know how this is implemented at the job submission level.

Comment on lines +42 to +48
self.totalScoutJobs = Operations().getValue('WorkloadManagement/Scouting/totalScoutJobs', 10)
self.criteriaFailedRate = Operations().getValue('WorkloadManagement/Scouting/criteriaFailedRate', 0.5)
self.criteriaSucceededRate = Operations().getValue('WorkloadManagement/Scouting/criteriaSucceededRate', 0.3)
self.criteriaStalledRate = Operations().getValue('WorkloadManagement/Scouting/criteriaStalledRate', 1.0)
self.criteriaFailed = Operations().getValue('WorkloadManagement/Scouting/criteriaFailed',
int(self.totalScoutJobs * self.criteriaFailedRate))
self.criteriaSucceeded = Operations().getValue('WorkloadManagement/Scouting/criteriaSucceeded',
int(self.totalScoutJobs * self.criteriaSucceededRate))
self.criteriaStalled = Operations().getValue('WorkloadManagement/Scouting/criteriaStalled',
int(self.totalScoutJobs * self.criteriaStalledRate))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go via the Agent options, or is this used also in other places?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case you have to add the agent to the ConfigTemplate file?

def execute(self):
"""The ScoutingJobStatus execution method.
"""
result = self.jobDB.selectJobs({'Status': 'Scouting'})

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobStateUpdateClient doesn't seem to have a method that does what we want here. However, it seems that getJobs() in JobMonitoringHandler might have this functionality. Would you recommend using getJobs, or implementing this in JobStateUpdateHandler?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getJobs.

"""Sets defaults
"""

self.jobDB = JobStateUpdateClient()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would do this, for added clarity:

Suggested change
self.jobDB = JobStateUpdateClient()
self.jobStateUpdate = JobStateUpdateClient()

then you'll need at least also

        self.jobMonitoring = JobMonitoringClient()

failedjoblist = []
failedsitelist = []
stalledjoblist = []
scoutjobs = result['Value'].keys()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is safer:

Suggested change
scoutjobs = result['Value'].keys()
scoutjobs = list(result['Value'])


if len(donejoblist) >= criteriaSucceeded:
self.log.verbose(f'Scout (ID = {scoutid}) are done.')
scoutStatus = {'Status': 'Checking', 'MinorStatus': 'Scouting', 'appstatus': 'Scout Complete'}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scoutStatus = {'Status': 'Checking', 'MinorStatus': 'Scouting', 'appstatus': 'Scout Complete'}
scoutStatus = {'Status': JobStatus.CHECKING, 'MinorStatus': 'Scouting', 'appstatus': 'Scout Complete'}

and similar in the lines below.

Comment on lines +54 to +55
self.jobLog.info('Skipping optimizer, since no scout \
corresponding to this job group')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would print lots of empty spaces. Better to do

Suggested change
self.jobLog.info('Skipping optimizer, since no scout \
corresponding to this job group')
self.jobLog.info('Skipping optimizer, since no scout '
'corresponding to this job group')

result = jobState.getAttribute('RescheduleCounter')
if not result['OK']:
return result
if result['Value'] == None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if result['Value'] == None:
if result['Value'] is None:

@fstagni

fstagni commented Apr 23, 2024

Copy link
Copy Markdown
Contributor

Ping! @qdcampagna

@qdcampagna

Copy link
Copy Markdown
Author

Sorry for the long delay, there were problems with the development server used to test this. We are working on setting up a new one so hopefully soon there will be progress on this.

@fstagni

fstagni commented Oct 31, 2024

Copy link
Copy Markdown
Contributor

Hello, this PR is now 1 year old, and I do not see much movement.

@fstagni

fstagni commented May 13, 2025

Copy link
Copy Markdown
Contributor

Closing because of inaction.

@fstagni fstagni closed this May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introducing 'Scouting' Status in WMS state transitions

4 participants